require(mosaic)
require(mosaicData)
require(lmtest)
cols =  trellis.par.get()$superpose.symbol$col

Fitting a Logistic model

For a binary response, instead of modeling \(\pi\) (the response variable) like this, \[ \pi = \beta_0 + \beta_1 X \]

suppose we modeled it like this, \[ \log \left( \frac{\pi}{1-\pi} \right) = logit(\pi) = \beta_0 + \beta_1 X \]

This transformation is called the logit function. What are the properties of this function? Note that this implies that \[ \pi = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} \in (0,1) \]

The logit function constrains the fitted values to line within \((0,1)\), which helps to give a natural interpretation as the probability of the response actually being 1.

Fitting a logistic curve is mathematically more complicated than fitting a least squares regression, but the syntax in R is similar, as is the output. The procedure for fitting is called maximum likelihood estimation, and the usual machinery for the sum of squares breaks down. Consequently, there is no notion of \(R^2\), etc. We use the function glm() (for Generalized Linear Model) to fit a logistic regression.

data(Whickham)
Whickham = Whickham %>%
  mutate(isAlive = 2 - as.numeric(outcome))
logm = glm(isAlive ~ age + smoker, data=Whickham, family=binomial)
summary(logm)
## 
## Call:
## glm(formula = isAlive ~ age + smoker, family = binomial, data = Whickham)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2795  -0.4381   0.2228   0.5458   1.9581  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  7.599221   0.441231  17.223   <2e-16 ***
## age         -0.123683   0.007177 -17.233   <2e-16 ***
## smokerYes   -0.204699   0.168422  -1.215    0.224    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1560.32  on 1313  degrees of freedom
## Residual deviance:  945.02  on 1311  degrees of freedom
## AIC: 951.02
## 
## Number of Fisher Scoring iterations: 6
myplot = xyplot(jitter(isAlive) ~ age, groups=smoker, data=Whickham, alpha=0.5, pch=19, cex=2, ylab="isAlive")
fit.outcome = makeFun(logm)
plotFun(fit.outcome(age=x, smoker="Yes") ~ x, lwd=3, plot=myplot, add=TRUE)

plotFun(fit.outcome(age=x, smoker="No") ~ x, col=cols[2], lwd=3, add=TRUE)

  1. How can we interpret the coefficients of this model in the context of the problem?
  2. How does the smoker term affect the model?
Binning

Another way to make sense of a binary response variable is to bin the explanatory variable and then compute the average proportion of the response within each bin.

Whickham = Whickham %>%
  mutate(ageGroup = cut(age, breaks=10))
favstats(~isAlive | ageGroup, data=Whickham)
##       ageGroup min Q1 median Q3 max      mean        sd   n missing
## 1  (17.9,24.6]   0  1      1  1   1 0.9763780 0.1524700 127       0
## 2  (24.6,31.2]   0  1      1  1   1 0.9790576 0.1435679 191       0
## 3  (31.2,37.8]   0  1      1  1   1 0.9567901 0.2039597 162       0
## 4  (37.8,44.4]   0  1      1  1   1 0.9097222 0.2875796 144       0
## 5    (44.4,51]   0  1      1  1   1 0.8013245 0.4003310 151       0
## 6    (51,57.6]   0  0      1  1   1 0.7109375 0.4551083 128       0
## 7  (57.6,64.2]   0  0      1  1   1 0.6071429 0.4898456 168       0
## 8  (64.2,70.8]   0  0      0  0   1 0.2282609 0.4220114  92       0
## 9  (70.8,77.4]   0  0      0  0   1 0.1382979 0.3470634  94       0
## 10 (77.4,84.1]   0  0      0  0   0 0.0000000 0.0000000  57       0

Although this is not the preferred method for performing logistic regression, it can be illustrative to see how the logistic curve fits through this series of points.

# print(myplot)
binned.y = mean(~isAlive | ageGroup, data=Whickham)
binned.x = mean(~age | ageGroup, data=Whickham)
binplot = xyplot(binned.y ~ binned.x, cex=2, pch=19, col="orange", lwd=3)
plotFun(fit.outcome(age=x, smoker="Yes") ~ x, lwd=3, add=TRUE, plot=binplot)

plotFun(fit.outcome(age=x, smoker="No") ~ x, col=cols[2], lwd=3, add=TRUE)

Odds Ratios and Interpretation of Coefficients

The interpretation of the coefficients in a linear regression model are clear based on an understanding of the geometry of the regression model. We use the terms intercept and slope because a simple linear regression model is a line. In a simple logistic model, the line is transformed by the logit function. How do the coefficients affect the shape of the curve in a logistic model?

See this shiny app http://45.55.32.181/shiny/log_app/ to experiment with changes to the intercept and slope coefficients in a simple logistic model.

  1. How do changes in the intercept term affect the shape of the logistic curve?
  2. How do changes in the slope term affect the shape of the logistic curve?

We saw earlier that the link values are linear with respect to the explanatory variable. The link values are the \(\log\) of the odds. Note that if an event occurs with proability \(\pi\), then \[ odds = \frac{\pi}{1-\pi}, \qquad \pi = \frac{odds}{1+odds}. \] Note that while \(\pi \in [0,1]\), \(odds \in (0,\infty)\). Thus, we can interpret \(\hat{\beta_1}\) as the typical change in \(\log{(odds)}\) for each one unit increase in the explanatory variable. More naturally, the odds of success are multiplied by \(e^{\hat{\beta_1}}\) for each one unit increase in the explanatory variable, since this is the odds ratio. \[ \begin{aligned} odds_X &= \frac{\hat{\pi}_X}{1 - \hat{\pi}_X} = e^{\hat{\beta}_0 + \hat{\beta}_1 X} \\ odds_{X+1} &= \frac{\hat{\pi}_{X+1}}{1 - \hat{\pi}_{X+1}} = e^{\hat{\beta}_0 + \hat{\beta}_1 (X + 1)} \\ \frac{odds_{X+1}}{odds_X} &= \frac{e^{\hat{\beta}_0 + \hat{\beta}_1 (X + 1)}}{e^{\hat{\beta}_0 + \hat{\beta}_1 X}} = e^{\hat{\beta}_1} \end{aligned} \] Furthermore, since the logits are linear with respect to the explanatory variable, these changes are constant.

Finding confidence intervals for the odds ratio is easy. We could look at the point estimate and standard error from the summary() and use the qnorm() function to find a critical z value, or we could just use the convenience function confint().

exp(confint(logm))
## Waiting for profiling to be done...
##                   2.5 %       97.5 %
## (Intercept) 870.1646058 4916.0147645
## age           0.8708442    0.8957213
## smokerYes     0.5845122    1.1318806

Assessing a Logistic Fit

Three of the conditions we require for linear regression have analogs for logistic models:

However, the requirements for Constant Variance and Normality are no longer applicable. In the first case, the variability in the response now inherently depends on the value, so we know we won’t have constant variance. In the second case, there is no reason to think that the residuals will be normally distributed, since the “residuals” are can only be computed in relation to 0 or 1. So in both cases the properties of a binary response variable break down the assumptions we made previously.

Moreover, since we don’t have any sums of squares, we can’t use \(R^2\), ANOVA, or \(F\)-tests. Instead, since we fit the model using maximum likelihood estimation, we compute the likelihood of our model. \[ L(y_i) = \begin{cases} \hat{\pi} & \text{if } y_i=1 \\ 1-\hat{\pi} & \text{if } y_i=0 \end{cases},\qquad L(model) = \prod_{i=1}^n L(y_i) \] Because these numbers are usually very small (why?), it is more convenient to speak of the log-likelihood \(\log(L)\), which is always negative. A larger \(\log(L)\) is closer to zero and therefore a better fit.

The log-likelihood is easy to retrieve

logLik(logm)
## 'log Lik.' -472.5098 (df=3)

but is nearly as easy to calculate directly.

pi= fitted.values(logm) 
likelihood = ifelse(Whickham$isAlive == 1, pi, 1 - pi)
log(prod(likelihood))
## [1] -472.5098

The closest thing to an analog of the \(F\)-test is the Likelihood Ratio Test (LRT). Here, our goal is to compare the log-likelihoods of two models: the one we build vs. the constant model. This is similar to the way we compared the sum of the squares explained by a linear regression model to the model that consists solely of the grand mean.

The null hypothesis in the LRT is that \(\beta_1 = \beta_2 = \cdots = \beta_k = 0\). The alternative hypothesis is that at least one of these coefficients is non-zero. The test statistic is: \[ G = -2\log(\text{constant model}) - (-2 \log(\text{model})). \] These two quantities are known as deviances. It can be shown that \(G\) follows a \(\chi^2\) distribution with \(k\) degrees of freedom. This allows us to compute a \(p\)-value for the model.

lrtest(logm)
## Likelihood ratio test
## 
## Model 1: isAlive ~ age + smoker
## Model 2: isAlive ~ 1
##   #Df  LogLik Df Chisq Pr(>Chisq)    
## 1   3 -472.51                        
## 2   1 -780.16 -2 615.3  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this sense the LRT has obvious similarities to the ANOVA table and \(F\)-test. In the same way that we previously performed a nested \(F\)-test to assess the usefulness of a group of predictors, we can perform a nested LRT.

Adding interaction or quadratic terms works in much the same way as it did with linear regression.

linteract = glm(isAlive ~ age + smoker + age*smoker, data=Whickham, family=binomial)
summary(linteract)
## 
## Call:
## glm(formula = isAlive ~ age + smoker + age * smoker, family = binomial, 
##     data = Whickham)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3983  -0.4256   0.2163   0.5598   1.9283  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    8.169231   0.606600  13.467   <2e-16 ***
## age           -0.133231   0.009953 -13.386   <2e-16 ***
## smokerYes     -1.457843   0.837232  -1.741   0.0816 .  
## age:smokerYes  0.022235   0.014495   1.534   0.1250    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1560.32  on 1313  degrees of freedom
## Residual deviance:  942.68  on 1310  degrees of freedom
## AIC: 950.68
## 
## Number of Fisher Scoring iterations: 6
  1. Interpret the coefficients of this model, and find confidence intervals for them.

Suppose now that we suspect that there are diminishing returns to the extent to which being alive is associated with a person’s age. We can easily add quadratic terms.

lquad = glm(isAlive ~ age + smoker + age*smoker + I(age^2) + I(age^2):smoker, data=Whickham, family=binomial)
summary(lquad)
## 
## Call:
## glm(formula = isAlive ~ age + smoker + age * smoker + I(age^2) + 
##     I(age^2):smoker, family = binomial, data = Whickham)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7524  -0.2808   0.2680   0.5172   2.1367  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.9879917  1.4181591   2.107 0.035122 *  
## age                 0.0737044  0.0571073   1.291 0.196832    
## smokerYes           1.5039169  2.1311556   0.706 0.480386    
## I(age^2)           -0.0019389  0.0005542  -3.499 0.000467 ***
## age:smokerYes      -0.0915658  0.0866460  -1.057 0.290611    
## smokerYes:I(age^2)  0.0010145  0.0008558   1.185 0.235879    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1560.32  on 1313  degrees of freedom
## Residual deviance:  928.94  on 1308  degrees of freedom
## AIC: 940.94
## 
## Number of Fisher Scoring iterations: 6
  1. Interpret the coefficients of this model, and find confidence intervals for them.

How can we assess whether these terms are warranted? Just like the nested \(F\)-test, the nested LRT gives us information about the incremental contribution of a set of terms to our model.

lrtest(logm, linteract, lquad)
## Likelihood ratio test
## 
## Model 1: isAlive ~ age + smoker
## Model 2: isAlive ~ age + smoker + age * smoker
## Model 3: isAlive ~ age + smoker + age * smoker + I(age^2) + I(age^2):smoker
##   #Df  LogLik Df   Chisq Pr(>Chisq)   
## 1   3 -472.51                         
## 2   4 -471.34  1  2.3391   0.126163   
## 3   6 -464.47  2 13.7443   0.001036 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Experiment with different models until you find one that you think fits the data best. Justify your answer.
print(myplot)

fit.qalive = makeFun(lquad)
plotFun(fit.outcome(age=x, smoker="Yes") ~ x, add=TRUE, lty=2)

plotFun(fit.outcome(age=x, smoker="No") ~ x, col=cols[2], add=TRUE, lty=2)

plotFun(fit.qalive(age=x, smoker="Yes") ~ x, add=TRUE)

plotFun(fit.qalive(age=x, smoker="No") ~ x, col=cols[2], add=TRUE)

Two-Way Tables and Chi-Squared Tests

Consider the logistic model for \(Acceptance\) to medical school based on \(Sex\). This is a simple logistic regression model in which both the response and explanatory variable are binary.

MedGPA = read.csv("http://www.math.smith.edu/~bbaumer/mth247/MedGPA.csv")
logm = glm(Acceptance ~ Sex, data=MedGPA, family=binomial)
summary(logm)
## 
## Call:
## glm(formula = Acceptance ~ Sex, family = binomial, data = MedGPA)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.435  -1.084   0.940   0.940   1.274  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   0.5878     0.3944   1.490    0.136
## SexM         -0.8109     0.5528  -1.467    0.142
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 75.791  on 54  degrees of freedom
## Residual deviance: 73.594  on 53  degrees of freedom
## AIC: 77.594
## 
## Number of Fisher Scoring iterations: 4

As we can see, the \(p\)-value for the coefficient of \(Sex\) is not significant at the 10% level.

Since both variables in this case are binary, an essentially equivalent model is to simply compute a 2-way table.

two.way = tally(~Acceptance | Sex, data=MedGPA, format="count")
two.way
##           Sex
## Acceptance  F  M
##          0 10 15
##          1 18 12

Note that the logistic model only has two possible imputs (and therefore two possible outputs), and that they match the entries in the 2-way table.

fit.accept = makeFun(logm)
fit.accept(Sex="M")
##         1 
## 0.4444444
fit.accept(Sex="F")
##         1 
## 0.6428571

Furthermore, the odds ratio from the 2-way table is the same as the odds ratio from the logistic model.

oddsRatio(two.way)
## [1] 2.25
# Since the coefficients is negative, we add a negative here to match the 2-way table
exp(-coef(logm))
## (Intercept)        SexM 
##   0.5555556   2.2500000

When we built the logistic regression model, we obtained a \(p\)-value for the coefficient of \(Sex\). This was helpful, because it gave us some information about the extent of the statistical significance of the relationship between \(Sex\) and \(Acceptance\). Can we obtain something similar from the 2-way table? Indeed, the \(\chi^2\)-test gives essentially the same information.

chisq.test(two.way[1:2,1:2], correct=FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  two.way[1:2, 1:2]
## X-squared = 2.1825, df = 1, p-value = 0.1396

Classification

We already learned how to compare models using the deviance, but how do we know how well our model works? One techinque for assessing the goodness-of-fit in a logistic regression model is to examine the percentage of the time that our model was “right.”

Recall the logistic model that we built for \(isAlive\) as a function of \(age\) and \(smoker\).

Whickham = Whickham %>%
  mutate(isAlive = 2 - as.numeric(outcome))
logm = glm(isAlive ~ age + smoker, data=Whickham, family=binomial)
summary(logm)
## 
## Call:
## glm(formula = isAlive ~ age + smoker, family = binomial, data = Whickham)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2795  -0.4381   0.2228   0.5458   1.9581  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  7.599221   0.441231  17.223   <2e-16 ***
## age         -0.123683   0.007177 -17.233   <2e-16 ***
## smokerYes   -0.204699   0.168422  -1.215    0.224    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1560.32  on 1313  degrees of freedom
## Residual deviance:  945.02  on 1311  degrees of freedom
## AIC: 951.02
## 
## Number of Fisher Scoring iterations: 6

When we visualize the model, it’s hard to assess how often our model is correct.

plotPoints(jitter(isAlive) ~ age, groups=smoker, data=Whickham, alpha=0.3, pch=19, cex=2,
           ylab="Probability of Being Alive (units)",
           xlab="Age (years)", main="Whickham Study Outcomes",
           sub=paste("Number of Cases = ", nrow(Whickham)),
           auto.key=TRUE)

fit.outcome = makeFun(logm)
plotFun(fit.outcome(age=x, smoker="Yes") ~ x, add=TRUE)

plotFun(fit.outcome(age=x, smoker="No") ~ x, col=cols[2], add=TRUE)

Moreover, what does it even mean to be correct? The response variable is binary, but the predictions generated by the model are quantities on \([0,1]\). A simple way to classify the fitted values of our model is to simply round them off. Once we do this, we can tabulate how often the rounded-off probability from the model agrees with the actual response variable.

Whickham = Whickham %>%
  mutate(fitted = fitted.values(logm)) %>%
  mutate(fit.alive = ifelse(fitted >= 0.5, 1, 0))
tally(~isAlive | fit.alive, data=Whickham)
##        fit.alive
## isAlive   0   1
##       0 251 118
##       1  94 851
tbl = tally(~isAlive | fit.alive, data=Whickham, format="count")
sum(diag(tbl)) / nrow(Whickham)
## [1] 0.8386606

Thus, our model was correct nearly 84% of the time. Is that good? It depends on the application…

We might want to compare this with what would happen if we used the null model, just predicting every value by the mean

mean(~isAlive, data=Whickham)
## [1] 0.7191781

or if we had just flipped a coin instead of using a model at all.

Whickham = Whickham %>%
   mutate(fit.alive = sample(c(0,1), size=1314, replace=TRUE))
tally(~isAlive | fit.alive, data=Whickham)
##        fit.alive
## isAlive   0   1
##       0 182 187
##       1 480 465
  1. Which of these “models” is best?

Concordance

A more sophisticated technique for assessing the quality of fit is to examine the \(C\)-statistic, where \(C\) stands for concordance. The idea here is to pair up each actual success with each actual failure, and then compute whether the fitted probability for the success was greater than the fitted probability for the failure. The percentage of pairs statisfying this condition makes up the \(C\)-statistic.

Note that even if our model had no predictive power, it would still be right about half of the time. Thus, the \(C\)-statistic for two randomly generated vectors is about 0.5.

X = data.frame(a = runif(10000), b = runif(10000))
require(Hmisc)
rcorrcens(a~b, data=X)
## 
## Somers' Rank Correlation for Censored Data    Response variable:a
## 
##       C   Dxy  aDxy    SD    Z      P     n
## b 0.503 0.006 0.006 0.007 0.96 0.3382 10000

In this case our model is much more effective.

rcorrcens(isAlive ~ fitted.values(logm), data=Whickham)
## 
## Somers' Rank Correlation for Censored Data    Response variable:isAlive
## 
##                         C   Dxy  aDxy   SD    Z P    n
## fitted.values(logm) 0.893 0.786 0.786 0.02 39.6 0 1314